New Haven ESDA

Author

Xingye Zhang

Published

December 3, 2024

1 Introduction

1.1 Research Background

  • Used HPC (High-Performance Computing), I reprocessed and refined building-related attributes data for 413 major U.S. cities in the UT-GLOBUS dataset.
  • Used reprocessed New Haven attribute data as a case study to conduct ESDA(spatial patterns, spatial autocorrelations and 3D visualization).

1.2 ESDA

Exploratory Spatial Data Analysis (ESDA) addresses questions that traditional Exploratory Data Analysis (EDA) cannot. For instance, ESDA helps determine whether a variable (or phenomenon) occurs randomly across space or follows a specific spatial pattern. It also enables the identification of spatial clustering—revealing whether a variable is clustered in space and, if so, where are these clusters.

2 Exploratoty Spatial Data Analysis: New Haven

2.1 Import Pacakges

Code
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

# Updated imports from the PySAL ecosystem
from libpysal.weights import KNN, Queen  # For spatial weights
from esda.moran import Moran, Moran_Local  # For Moran's I tests
import matplotlib.pyplot as plt
import contextily as ctx
import geopandas as gpd
from splot.libpysal import plot_spatial_weights
from libpysal.weights import DistanceBand
from splot.esda import moran_scatterplot
from esda.moran import Moran_Local
from splot.esda import lisa_cluster
from shapely.geometry import Point
from spreg import ML_Lag
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

import warnings 
warnings.filterwarnings( "ignore" )

2.2 Data Preprocessing

Code
# Load the CSV file
data = pd.read_csv('/Users/austin/Downloads/WithCentroids_CSV/New_Haven_with_centroids.csv')

# Filter data to remove extreme height values (height > 0 and height < 100)
data_filtered = data[(data['height'] > 0) & (data['height'] < 100)].copy()

# Create geometry column based on centroid_x and centroid_y
geometry = [Point(xy) for xy in zip(data_filtered['centroid_x'], data_filtered['centroid_y'])]

# Convert to GeoDataFrame
data_filtered = gpd.GeoDataFrame(data_filtered, geometry=geometry, crs="EPSG:4326")

2.3 Descriptive Statistics

Code
# Display summary statistics
print(data_filtered[['height', 'Area', 'Volume', 'Surface']].describe())
              height           Area        Volume        Surface
count  331418.000000  331418.000000  3.314180e+05  331418.000000
mean        7.312011     231.301173  2.073909e+03     719.587940
std         1.803568     747.279371  1.198318e+04    1363.131027
min         2.000000       0.000000  2.000000e+00      13.000000
25%         7.000000     100.000000  6.910000e+02     391.000000
50%         7.000000     164.000000  1.217000e+03     579.000000
75%         8.000000     243.000000  1.971000e+03     817.000000
max        99.000000  132059.000000  2.331433e+06  196479.000000
Code
# Plot histogram
plt.figure(figsize=(10, 6))
sns.histplot(data=data_filtered, x='height', binwidth=1, color='skyblue', edgecolor='black')

# Customize plot
plt.title('Distribution of Building Heights', fontsize=16)
plt.xlabel('Height', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show plot
plt.show()

Code
# Compute and display pairwise correlation
correlation_matrix = data_filtered[['height', 'Area', 'Volume', 'Surface']].corr()

print(correlation_matrix)
           height      Area    Volume   Surface
height   1.000000  0.283269  0.343673  0.460742
Area     0.283269  1.000000  0.861150  0.927005
Volume   0.343673  0.861150  1.000000  0.937878
Surface  0.460742  0.927005  0.937878  1.000000
  • The correlation matrix shows that height has a moderate positive correlation with Area, Volume, and Surface, indicating some relationship but not a strong dependency.

  • Area, Volume, and Surface are highly correlated with each other, reflecting their strong interdependence.

2.4 Spatial Analysis

2.4.1 Choropleth Map

Code
# Fill missing values in 'height' column
data_filtered['height'] = data_filtered['height'].fillna(0)

# Plotting the choropleth map
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

data_filtered.plot(
    column='height',  # Column to visualize
    cmap='Oranges',  # Color scheme
    scheme='naturalbreaks',  # Classification scheme
    legend=True,  # Show legend
    ax=ax,  # Plot on the same axes
    legend_kwds={
        'loc': 'lower right',  # Legend location
        'fmt': "{:.0f}",  # Format for legend numbers
    },
    edgecolor='black',  # Add borders to polygons
    linewidth=0.03,  # Border thickness
)

# Add title and labels
ax.set_title('Choropleth Map of Building Heights', fontsize=16)
ax.set_xlabel('Longitude', fontsize=12)
ax.set_ylabel('Latitude', fontsize=12)

# Show plot
plt.show()

  • The buildings are grouped into five categories based on their heights: 2–5 meters, 5–7 meters, 7–9 meters, 9–24 meters, and 24–99 meters.

  • Most regions exhibit lower building heights, suggesting predominance of low-rise structures in the overall landscape.

2.4.2 Spatial Weights: Queen Contiguity

Code
# Create Queen contiguity spatial weights
queen_weights = Queen.from_dataframe(data_filtered)

# Visualize spatial weights
fig, ax = plt.subplots(figsize=(10, 10))
plot_spatial_weights(queen_weights, data_filtered, ax=ax)
plt.title("Spatial Weights Visualization: Queen Contiguity", fontsize=16)
plt.show()

  • Each point is connected to its immediate neighbors sharing a border or vertex, forming a network of spatial dependencies.

2.4.3 Spatial Weights: K-Nearest Neighbors

Code
data_filtered = data_filtered.reset_index(drop=True)

# Create KNN spatial weights (k=4)
knn_weights = KNN.from_dataframe(data_filtered, k=4)

# Visualize KNN spatial weights
fig, ax = plt.subplots(figsize=(10, 10))
plot_spatial_weights(knn_weights, data_filtered, ax=ax)
plt.title("Spatial Weights Visualization: KNN (k=4)", fontsize=16)
plt.show()

  • Use K-Nearest Neibours instead of Queen: Dataset is point-based(x and y coordinates, height) without clear boundaries.

2.4.4 Global Spatial Autocorrelation: Moran’s I

Code
# Ensure your data is a GeoDataFrame
coordinates = data_filtered[['centroid_x', 'centroid_y']]

# Create spatial weights matrix using k-nearest neighbors (k=4)
knn_weights = KNN.from_dataframe(data_filtered, k=4)

# Standardize the weights to row-normalized (equivalent to "W" style in R)
knn_weights.transform = 'R'

# Calculate Moran's I for the 'height' variable
moran = Moran(data_filtered['height'], knn_weights)

# Print Moran's I results
print(f"Moran's I: {moran.I}")
print(f"Expected I: {moran.EI}")
print(f"p-value: {moran.p_sim}")
print(f"z-score: {moran.z_sim}")
Moran's I: 0.32124828050255405
Expected I: -3.0173467263296692e-06
p-value: 0.001
z-score: 271.0988808513457
  • Global Moran’s I measures the overall spatial autocorrelation, indicating whether similar values cluster spatially.

  • Result: Moran’s I suggests significant positive spatial autocorrelation, meaning building heights exhibit clustering.

2.4.5 Local Indicators of Spatial Association: Local Moran’s I

Code
# Calculate Local Moran's I
lisa = Moran_Local(data_filtered['height'], knn_weights)

# Print summary statistics
print(f"Number of Significant Clusters: {(lisa.p_sim < 0.05).sum()}")
Number of Significant Clusters: 60765
  • Local Moran’s I identifies clusters and spatial outliers at the individual feature level, allowing us to detect where significant spatial patterns occur within the dataset.
Code
# Plot the local Moran's I scatterplot
fig, ax = moran_scatterplot(lisa, p=0.05)

# Add four quadrant labels to the plot
plt.text(2, 1, 'HH', fontsize=15, color='red')  # High-High
plt.text(2, -1, 'HL', fontsize=15, color='orange')  # High-Low
plt.text(-2, 1, 'LH', fontsize=15, color='blue')  # Low-High
plt.text(-2, -1, 'LL', fontsize=15, color='green')  # Low-Low

# Set title and display the plot
plt.title("Moran Local Scatterplot", fontsize=16)
plt.show()

  • Most points concentrated in the high-high (HH) quadrant, indicating areas where tall buildings are spatially clustered together.
Code
# Add LISA cluster results to the dataset
data_filtered['lisa_cluster'] = lisa.q

# Map LISA classification results to descriptive categories
data_filtered['lisa_category'] = data_filtered['lisa_cluster'].map({
    1: 'HH',  # High-High
    2: 'LH',  # Low-High
    3: 'LL',  # Low-Low
    4: 'HL',  # High-Low
}).fillna('ns')  # Non-significant points marked as 'ns'

# Define custom color mapping for each LISA category
lisa_colors = {
    'HH': '#D7191C',  # Red for High-High
    'HL': '#FDAE61',  # Orange for High-Low
    'LH': '#ABD9E9',  # Light Blue for Low-High
    'LL': '#2C7BB6',  # Blue for Low-Low
    'ns': '#D3D3D3'   # Gray for non-significant points
}

# Plot the LISA classification map
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
data_filtered.plot(
    column='lisa_category',  # Column to display
    cmap=plt.matplotlib.colors.ListedColormap(list(lisa_colors.values())),  
    categories=list(lisa_colors.keys()),  # Specify the categories for the legend
    legend=True,  # Display legend
    ax=ax,  # Plot on the same axis
    edgecolor='k',  # Set border color to black
    linewidth=0.03  # Set border thickness
)

# Add title to the map
ax.set_title('LISA Moran Categories of Building Heights', fontsize=16)

# Show the plot
plt.show()

  • This map aligns with the results of the previous scatterplot, confirming that the majority of the clusters fall under the High-High (HH) category.

2.4.6 Spatial Lag Model

Code
# Take a random 1% sample of the dataset
data_sample = data_filtered.sample(frac=0.01)

# Define dependent variable (height)
y_sample = data_sample['height'].values.reshape(-1, 1)

# Define independent variables (centroid coordinates)
X_sample = data_sample[['centroid_x', 'centroid_y']].values

# Create K-nearest neighbors spatial weights for the sample (k=4)
knn_weights_sample = KNN.from_dataframe(data_sample, k=4)
knn_weights_sample.transform = 'R'  # Row-standardize the weights

# Print summary to verify
print(f"Sample size: {len(data_sample)}")
print(f"KNN weights for the sample: {knn_weights_sample.n}")
Sample size: 3314
KNN weights for the sample: 3314
  • Selecting a 1% sample of the dataset for efficient model testing.
Code
# Add a constant term to the independent variables
X_sample = np.hstack([np.ones((X_sample.shape[0], 1)), X_sample]) 

## Fit the Spatial Lag Model
slm_sample = ML_Lag(
    y_sample, 
    X_sample, 
    w=knn_weights_sample, 
    name_y='height', 
    name_x=['constant', 'centroid_x', 'centroid_y'], 
    name_w='KNN (k=4)', 
    name_ds='Sample of New Haven Attributes Dataset'
)

# Print results
print(slm_sample.summary)
REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :Sample of New Haven Attributes Dataset
Weights matrix      :   KNN (k=4)
Dependent Variable  :      height                Number of Observations:        3314
Mean dependent var  :      7.3120                Number of Variables   :           4
S.D. dependent var  :      1.6631                Degrees of Freedom    :        3310
Pseudo R-squared    :      0.0867
Spatial Pseudo R-squared:  0.0288
Log likelihood      :  -6269.0668
Sigma-square ML     :      2.5361                Akaike info criterion :   12546.134
S.E of regression   :      1.5925                Schwarz criterion     :   12570.557

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        58.12803         9.40069         6.18338         0.00000
          centroid_x         0.76853         0.10763         7.14026         0.00000
          centroid_y         0.07671         0.16620         0.46158         0.64438
            W_height         0.26664         0.02211        12.06069         0.00000
------------------------------------------------------------------------------------
Warning: Variable(s) ['constant'] removed for being constant.

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
          centroid_x         0.7685          0.2794          1.0479
          centroid_y         0.0767          0.0279          0.1046
================================ END OF REPORT =====================================
  • Spatial Dependence: Building heights show clear spatial dependence, influenced by the heights of nearby buildings.

  • Coordinate Effects: The x-coordinate strongly impacts height, while the y-coordinate shows little influence.

  • Model Performance: The model provides limited explanatory power, as indicated by the low pseudo R-squared and spatial pseudo R-squared values. It’s possibly caused by limited sample size(frac=0.01) among entire New Haven area.

2.4.7 3D visualization of Building Heights

Code
import plotly.graph_objects as go

# Prepare data
x = data_filtered['centroid_x']
y = data_filtered['centroid_y']
z = data_filtered['height']

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=x,  # Longitude (centroid_x)
    y=y,  # Latitude (centroid_y)
    z=z,  # Height
    mode='markers',
    marker=dict(
        size=3,
        color=z,  # Color by height
        colorscale='Viridis',  # Colormap
        showscale=True,  # Show colorbar
        colorbar=dict(title='Height (m)')
    )
)])

# Update layout for better visualization
fig.update_layout(
    title='3D Visualization of Building Heights',
    scene=dict(
        xaxis_title='Longitude',
        yaxis_title='Latitude',
        zaxis_title='Height',
    ),
    margin=dict(l=0, r=0, b=0, t=40)
)

# Show the plot
fig.show()

3 Reference

  1. Kazumatsuda, “Exploratory Spatial Data Analysis (ESDA): Spatial Autocorrelation”

  2. GLObal Building heights for Urban Studies (UT-GLOBUS) for city- and street- scale urban simulations: Development and first applications